-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Fix @fastmath x^2 inlining regression for Float32 and Float16
#60640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix @fastmath x^2 inlining regression for Float32 and Float16
#60640
Conversation
|
@Yashagarwal9798 please be aware that all uses of AI must be disclosed |
0d4ccb4 to
c79d1c7
Compare
c79d1c7 to
102b4a1
Compare
102b4a1 to
1664287
Compare
|
Build failure looks real? |
6a69d79 to
bd9c0e9
Compare
|
@oscardssmith CI is all green now. |
|
@Yashagarwal9798 thanks for the PR! |
|
Thanks! really glad I could help improve the project. |
|
This is a candidate for backporting. Although this concerns only an optimization, generating inefficient code for |
|
Agreed. If we backport, we need to be careful not to back-port the Float16 version of this to 1.12 (LLVM only recently doesn't produce garbage on x86 with this intrinsic). 1.13 should receive the back-port unmodified though. I'll put up the 1.12 PR. |
@fastmath x^2 inlining regression for Float32 and Float16
## Summary This PR fixes the performance regression where `@fastmath x^2` for `Float32` was not being inlined to efficient LLVM code, unlike `Float64`. ## Problem As reported in #60639, `@fastmath x^2` for `Float32` was falling back to `power_by_squaring` instead of using the LLVM `powi` intrinsic. This resulted in: - Unnecessary function calls instead of inline multiplication - Potential type promotion to `Float64` - Suboptimal generated code compared to `Float64` Before this fix, `@code_llvm @fastmath Float32(1.5)^2` would show calls to `power_by_squaring`, while `Float64` correctly used the `llvm.powi` intrinsic. ## Solution Added the missing `pow_fast` methods for `Float32` and `Float16`: - `pow_fast(::Float32, ::Int32)` - uses `llvm.powi.f32.i32` intrinsic directly - `pow_fast(::Float32, ::Integer)` - wrapper that converts to `Int32` when safe, matching the `Float64` pattern - `pow_fast(::Float16, ::Integer)` - converts to `Float32`, computes, and converts back This mirrors the existing implementation for `Float64` which already used `llvm.powi.f64.i32`. ## Testing Added a regression test that verifies `@fastmath x^2` generates inline `fmul` instructions (not `power_by_squaring` calls) for `Float16`, `Float32`, and `Float64`. Fixes #60639 --------- Co-authored-by: Oscar Smith <[email protected]> (cherry picked from commit f34d5f2)
Summary
This PR fixes the performance regression where
@fastmath x^2forFloat32was not being inlined to efficient LLVM code, unlikeFloat64.Problem
As reported in #60639,
@fastmath x^2forFloat32was falling back topower_by_squaringinstead of using the LLVMpowiintrinsic. This resulted in:Float64Float64Before this fix,
@code_llvm @fastmath Float32(1.5)^2would show calls topower_by_squaring, whileFloat64correctly used thellvm.powiintrinsic.Solution
Added the missing
pow_fastmethods forFloat32andFloat16:pow_fast(::Float32, ::Int32)- usesllvm.powi.f32.i32intrinsic directlypow_fast(::Float32, ::Integer)- wrapper that converts toInt32when safe, matching theFloat64patternpow_fast(::Float16, ::Integer)- converts toFloat32, computes, and converts backThis mirrors the existing implementation for
Float64which already usedllvm.powi.f64.i32.Testing
Added a regression test that verifies
@fastmath x^2generates inlinefmulinstructions (notpower_by_squaringcalls) forFloat16,Float32, andFloat64.Fixes #60639